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Abstract 

We study the problem of validating XML documents of size against general DTDs in the context 
of streaming algorithms. The starting point of this work is a well-known space lower bound. There are 
XML documents and DTDs for which p-pass streaming algorithms require fl{N/p) space. 

We show that when allowing access to external memory, there is a deterministic streaming algo- 
rithm that solves this problem with memory space ©(log'^ A'^), a constant number of auxiliary read/write 
streams, and O(logA^) total number of passes on the XML document and auxiliary streams. 

An important intermediate step of this algorithm is the computation of the First-Child-Next-Sibling 
(FCNS) encoding of the initial XML document in a streaming fashion. We study this problem indepen- 
dently, and we also provide memory efficient streaming algorithms for decoding an XML document given 
in its FCNS encoding. 

Furthermore, validating XML documents encoding binary trees in the usual streaming model without 
external memory can be done with sublinear memory. There is a one-pass algorithm using O(vTVToglV) 
space, and a bidirectional two-pass algorithm using 0(log^ A'^) space performing this task. 

1 Introduction 

The area of streaming algorithms has experienced tremendous growth over the last decade in many appli- 
cations. Streaming algorithms sequentially scan the whole input piece by piece in one pass, or in a small 
number of passes (i.e., they do not have random access to the input), while using sublinear memory space, 
ideally polylogarithmic in the size of the input. The design of streaming algorithms is motivated by the ex- 
plosion in the size of the data that algorithms are called upon to process in everyday real-time applications. 
Examples of such applications occur in bioinformatics for genome decoding, in Web databases for the search 
of documents, or in network monitoring. The analysis of Internet traffic PP, in which traffic logs are queried, 
was one of the first applications of this kind of algorithm. 

There are various extensions of this basic streaming model. One of them gives the streaming algorithm 
access to an external memory consisting of several read/write streams [HIIIIIIZ]. Then the streaming algorithm 
is also relaxed to perform multiple passes in any direction over the input stream and the auxiliary streams. 

'Supported by the French ANR SeSur and Defis programs under contracts ANR-07-SESU-013 (VERAP project) and ANR- 
08-EMER-012 (QRAC project). Christian Konrad is supported by a Fondation CFM-JP Aguilar grant. 
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In most of the applications, the number of auxihary streams is constant and the total number of passes is 
logarithmic in the input size. 

Verifying properties or evaluating queries of massive databases is an active and challenging topic. For 
relational algebra queries against relational databases, the situation is quite clear. There are bidirectional 
0(logiV)-pass deterministic streaming algorithms with constant memory space and a constant number of 
auxiliary streams [S]- Moreover, the logarithmic number of passes is a necessary condition in order to keep 
the memory space sublinear, even if randomization is allowed. The latter was initially stated for one-sided 
error [8] and then extended to two-sided error [U [3] . 

In the context of data exchange, especially on the Web, Extended Markup Language (XML) is emerging 
as the standard, and is currently drawing much attention in data management research. Only few is known on 
XML query processing when only streaming access is allowed to the XML document. For evaluating XQuery 
and XPath queries against XML documents of size N, only the lower bound has been extended [H HI [5], 
meaning that r2(logiV) passes are necessary. For the upper bound, only simple refinements of the direct 
algorithm are known: no auxiliary stream, one pass and linear memory in the height of the XML document, 
which in the worst case is as large as N. 

This paper considers the problem of validating XML documents against a given Document Type Definition 
(DTD) in a streaming fashion without restrictions on the DTD. Prior works on that topic [HI [13 essentially 
try to characterize those DTDs for which validity can be checked by a finite-state automaton, that is a one- 
pass deterministic streaming algorithm with constant memory. Concerning arbitrary DTDs, two approaches 
have been considered in |18j . The first one leads to an algorithm with memory space linear in the height 
of the XML document [TH]. The second one consists in constructing a refined DTD of at most quadratic 
size, which defines a similar family of tree documents as the original one, and against which validation can 
be done with constant space. Nonetheless, for an existing document and DTD, the later requires that both, 
documents and DTD, are converted before validation. 

One of the obstacles prior works had to cope with was to check well-formedness of XML documents, that 
is every opening tag matches its same-level closing tag. Due to the past work of [13], we can now perform 
such a verification with a constant-pass randomized streaming algorithm with sublinear memory space and 
no auxiliary streams. In one-pass the memory space is 0{\/N log N), and collapses to 0(log^ TV) with an 
additional pass in reverse direction. 

The starting point of this work is the fact that checking validity is hard without auxiliary streams. 
There are DTDs defining ternary XML documents for which any p-pass bidirectional randomized streaming 
algorithm requires fl{N/p) space. This lower bound comes from encoding a well known communication 
complexity problem, Set-Disjointness, as an XML validity problem. This lower bound should be well-known, 
however we are not aware of a complete proof in the literature. In |10] . a similar approach using ternary 
trees with a reduction to Set-Disjointness is used for proving lower bounds for queries. For the sake of 
completeness we provide a proof in Appendix [^ 

An XML document is valid against a DTD if for each node, the sequence of the labels of their children 
fulfills a regular expression defined in the DTD. For the case of XML documents encoding binary trees, 
we present in Section [3| two deterministic streaming algorithms for checking validity with sublinear space. 
As a consequence, the presence of nodes of degree at least 3 is indeed a necessary condition for the linear 
space lower bound for general documents. We first show how to design a one-pass algorithm with space 
0{^/NlogN) (Theorem We conjecture that there is a fl{y/N /p) lower bound for p— pass algorithms 
which would render this algorithm tight up to a logarithmic factor, see Appendix [Bj With a second pass in 
reverse direction the memory collapses to 0(log^ N) (Theorem u^. These two algorithms make use of the 
simple, but fundamental fact that in one pass over an XML document each node is seen twice in form of its 
opening and closing tag. Hence, it is not necessary to remember all opening tags in the stream since there is 
a second chance to get the same information by their closing tags. Our algorithms exploit this observation. 

Then, in Section |4| we present our main result. Corollary [l] states that validity of any XML document 
against any DTD can be checked in the streaming model with external memory with poly-logarithmic space, 
a constant number of auxiliary streams, and 0(log A'^) passes over these streams. Validity of a node depends 
on its children, hence is crucial to have easy access to the sequence of children of any node. The fundamental 
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idea to establish this is, firstly, to compute the First-Child-Next-Sibling (FCNS) encoding, an encoding as 
a 2-ranked tree of the XML document. In this encoding, the sequence of closing tags of the children of a 
node are consecutive. The computation of this encoding is the hard part of the validation process, and the 
resource requirements of our validation algorithm stem from this operation (Theorem [s]). Since the FCNS 
encoding can be seen as a reordering of the tags of the original document, our strategy is to see this problem 
as a sorting problem with a particular comparison function. Merge sort can be implemented as a streaming 
algorithm, and we make use of it customized by an adapted merge function. The same idea can be used for 
decoding with similar complexity (Theorem [t]). 

Then, based on the FCNS encoding, verification can be done either in one pass and 0{^/N log A^) space 
(Theorem^, or in two bidirectional passes and 0(log^ N) space (Theoremjsj). Concerning FCNS encoding 
and decoding, we show a linear space lower bound for one-pass algorithms. For decoding, we present a 
0{y/N log N) algorithm (Theorem [g]) that performs one pass over the input, but two passes over the 
output. We conjecture for encoding that memory space remains D,{N) after any constant number of passes, 
which would show that decoding is easier than encoding. 

This suggests a systematic use of the FCNS encoding for large documents since validity can be checked 
easily without auxiliary streams and in sublinear space. For user interactions, the original document can be 
obtained by the sublinear space 3-pass algorithm. The applicability of this idea is left as an open question. 



2 Preliminaries 

From now E is a finite alphabet. The fc-th letter oi X G is denoted by X[fc], for 1 < fc < N, and the con- 
secutive letters oi X between positions i and j by X[i^ j]. A subsequence of X is any string X[«i]X[i2] . . . ^[ife], 
where 1 < ii < 12 < •■• < < ^• 



2.1 Streaming model 

In streaming algorithms, a pass over input X £ E^ means that X is given as input stream 
X[l], X[2], . . . , X[N], which arrives sequentially, i.e., letter by letter in this order. Streaming algorithms 
have access to random access memory space, and, as the case may be, to read-write external memory as 
in [SlIS]. See also the review in [3]. We assume that any letter of S fits into one cell of internal/external 
memory. The external memory is a collection of auxiliary streams, that we see as read/write streams with, 
again, sequential access. When needed, we augment the alphabet of auxiliary streams from E by fc-tuples of 
elements in E U [0, 2iV], for some fixed constant fc, which therefore fit in one cell of auxiliary streams. 

At the beginning of each pass on a read/write stream, the algorithm decides whether it performs a read or 
write pass. The input stream is read-only. On a writing pass, the algorithm can either write a letter, and then 
move to the next cell, or move directly to the next cell. For the case of bidirectional streaming algorithms, 
opposed to unidirectional streaming algorithm where each pass is in the same order, the algorithm can decide 
the direction of the sequential access. 

For simplicity, we assume throughout this article that the length of the input is known in advance by the 
algorithm. Nonetheless, all our algorithms can be adapted to the case in which the length is unknown until 
the end of a pass. See |15j for an introduction to streaming algorithms. 

Definition 1 (Streaming algorithm). A p{N)-pass streaming algorithm A with s{N) space, k{N) auxiliary 
streams, t{N) processing time per letter is an algorithm such that for every input stream X £ E^; 

1. A has access to k{N) auxiliary read/write streams, 

2. A performs in total at most p{N) passes on X and auxiliary streams, 

3. A maintains a memory space of size s{N) letters ofH and bits while reading X and auxiliary streams, 
4- A does not exceed a running time oft{N) between two write or read operations. 

We say that A is bidirectional if it performs at least one pass in each direction. Otherwise A is implicitly 
unidirectional. 
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We do not mention the number of auxiliary streams when there are none (k{N) = 0). Furthermore, we 
assume that operations on numbers N £ [0, 2N] can be done in constant time. 



2.2 XML documents 

We consider finite unranked ordered labeled trees t, where each tree node is labeled by some label in E, and 
its root has a distinguished label r. Moreover, the children of every non-leaf node are ordered. From now, 
we omit the terms ordered and labeled. Then A:-ranked trees are a special case where each node has at most 
k children. Binary trees are a special type of 2-ranked trees, where each node is either a leaf or has exactly 
2 children. 

For each label a € S, we associate its corresponding opening tag a and closing tag a, standing for (a) and 
(/a) in the usual XML notations. An XML sequence is a sequence over the alphabet S' = {a, a : a e S}. 
The XML sequence of a tree t is the sequence of opening tags and closing tags in the order of a depth first 
traversal of t (Figure [T]): when at step i we visit a node with label a top-down (respectively bottom- up), we 
let X[i] = a (respectively X[i] — a). Hence X is a word over E' = {a, a : a G S} of size twice the number of 
nodes of t. The XML file describing t is unique, and we denote it as XML(t). 
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Figure 1: Let S = {a, 6, c}, and let t be the tree as above. Then XML(t) rbaaaaccbbbbaaaabccr . 

We assume that the input XML sequences X are well-formed, namely X = XML(t), for some tree t. The 
past work of 114] legitimates this assumption since checking well-formedness is at least as easy as any of our 
algorithms for checking validity. Hence we could run an algorithm for well-formedness in parallel without 
increasing the resource requirements. Note, that randomness is necessary for checking well-formedness with 
sublinear space, whereas we will show that randomness is useless for validation. 

Let us introduce more useful notations. Since the length of a well-formed XML sequence is known in 
advance, we will denote it by 2N instead of N. Each opening tag X[i] and matching closing tag X[j] in 
X = XML{t) corresponds to a unique tree node v of t. We sometimes denote v either by X[i] or X[j]. We 
also write (ambiguously) v for its corresponding opening tag, and v for its corresponding closing tag. Then, 
the position of w in X is pos(u) = i. Similarly, pos(?J) = j. 

We consider XML validity against some DTD. A DTD is a mapping D from E to regular expressions 
over S. Let i be a tree. Then a node v £ t with label v and children vi,V2, ■ ■ ■ ,Vk with respective labels 
vi, V2, . • . , ffc is valid against D if t^i, U2, . . . , Wfc satisfies the regular expression D(v). In particular, v can be 
a leaf if and only if the empty word e satisfies the regular expression D{v). Then t is valid against D if all 
its nodes are valid against D. Throughout the document we assume that DTDs are considerably small and 
our algorithms have full access to them without accounting this to their space requirements. 

Definition 2 (Validity). Let D be some DTD. The problem Validity consists of deciding whether an 
input tree t given by its XML sequence XML(i) on an input stream is valid against D. 

We denote by Validity(2) the problem Validity restricted to input XML sequences describing binary 
trees. 

3 Validity of binary trees 

For simplicity, we only consider binary trees in this section. A left opening/closing tag (respectively right 
opening/closing tag) of an XML sequence A" is a tag whose corresponding node is the first child of its parent 
(respectively second child). 
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Our algorithms for binary trees can be extended to 2-ranked trees. This requires few changes in the 
one-pass Algorithm [T] and the two-pass Algorithm [2] (indeed in the subroutine Algorithm [s]) , that we do not 
describe here. 

We fix now a DTD Z?, and assume that, in our algorithms, we have access to a procedure check(a, 6, c) 
that signalizes invalidity and aborts if be is not valid against the regular expression D{a). Otherwise it 
returns without any action. 

3.1 One-pass algorithm 

In order to validate an XML document, we ensure validity of all tree nodes. For checking validity of a node 
V with two children, we have to relate 3 labels, that is the label v of the node itself, and the labels of the 
two children nodes wi,W2. In a top-down verification we use the opening tag v of the parent node v for 
verification, in a bottom-up verification we use the closing tag v of the parent node v. Algorithm [T] makes 
use of the fact that there are these two chances to verify a node. It uses a stack onto which it pushes all 
opening tags in order to perform top-down verifications once the information of the children nodes arrives 
on the stream. viV2 forms a substring of the input, hence top-down verification requires only the storage 
of the opening tag v since the labels of the children arrive in a block. The algorithm's space requirements 
depend on a parameter K (we optimize by setting K = ^/N logiV). Once the number of opening tags on 
the stack is about to exceed K , we remove the bottom-most opening tag. The corresponding node will then 
be verified bottom-up. Note that v^v forms a substring of the input. Hence for bottom-up verifications it is 
enough to store the label of the left child vi on the stack since the label of the right child arrives in form of 
a closing tag right before the closing tag of the parent node. See Algorithm [T] for details. 

For the unique identification of closing tags on the stack, we have to store them with their depth in the 
tree. A stack item corresponding to a closing tag requires hence 0(log A'^) space. Opening tags don't require 
the storage of their depth (we store a depth of —1 which we assume to require only constant space). 



Algorithm 1 Validity of binary trees in 1-pass 
Require: input stream is a well-formed XML document 

1: d 0, S empty stack 

2: K ^ y/NlogN 

3: while stream not empty do 

4: a; next tag on stream 

5: if X is an opening tag c then 

6: if a; is a leaf then check(c, e, e) end if 

7: if S has on top (a, —1), (b, d) then 

8: check(a, 6, c); pop S {Top-down verification} 

9: end if 

10: if |{(a, -1) £ S\a opening }\> K then 

11: remove bottom-most (a, — 1) in S, a opening 

12: end if 

13: d<-d-H 

14: push (x, —1) 

15: else if a; is a closing tag c then 

16: d d - 1 

17: if S has on top (a, d + 1), (b, d + 1) then 

18: check (c, a, &) {Bottom-up verification} 

19: pop 5", pop S 

20: else if S has on top (6, d -f 1) then pop S 

21: end if 

22: if S has on top (c, —1) then pop S end if 

23: push (x, d) 

24: end if 

25: end while 
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The query in line[6]can be implemented by a lookahead of 1 on the stream. The opening tag x corresponds 
to a leaf only if the subsequent tag in the stream is the corresponding closing tag x. 

Figure [2] visualizes the different cases with their stack modifications appearing in Algorithm [l] 



line [6] line [7] linc[T7] linc[20] line [22] 




X ^ X Xab ^ Xa Xab ^ X Xh ^ X Xc^ X 

Figure 2: Visualization of the different conditions in Algorithm [l] with the applied stack modifications. X 
represents the bottom part of the stack. Note that Algorithm [T] pushes the currently treated tag c or c on 
the stack in Line [14] or Line [23] c or c corresponds to the highlighted node. 

Fact [l] (not proved here) and Lemma [l] concern the structure of the stack S used in Algorithm [T] 

Fact 1. Let S ~ {x\. di), . . . {xk, dk) be the stack at the beginning of the while loop in line^ Then: 

1. pos(a;i) < pos(a;2) • • • < pos(a;fc), 

2. depth(a;i) < depth(a;2) • • ■ < depth(xfe) < d. Moreover, i/depth(a;i) — depth(xi_|_i) then Xi is the left 
sibling ofxi+i, 

3. The sequence xi . . . Xk satisfies the regular expression a*b*{e \ c \ de), where a* are left closing tags, b* 
are opening tags, c is a closing tag, d is a left closing tag, and e is a right closing tag. 

4- A left closing tag a is only removed from S upon verification of its parent node. 

Lemma 1. Let S — {xi,di), . . . (xk,dk) be the stack at the beginning of the while loop in line^ Let 
{ci, di), (ci+i, di+i) be two consecutive left closing tags in S such that (ci+i, di+i) is not the topmost one. 
Then pos(ci+i) > pos(ci") + 2K . 

Proof. Denote by X = X[1]X[2] . . . X[2A^] the input stream. Since Ci+i is not the topmost left closing tag 
in S, the algorithm has already processed the right sibling opening tag X[pos(ci+T) + 1] of Ci+T. By Item [4] 
of Fact[l] no verification has been done of the parent of Ci+T, since Ci+T is still in S. Therefore, the parent's 
opening tag X[k] of Ci+i has been deleted from S, where pos(ci) < fc < pos(ci+l). This can only happen if 
at least K opening tags have been pushed on S between X[k] and Ci+i. Since these K opening tags must 
have been closed between X[k] and c^+i we obtain pos(ci-|_i) > pos(ci) + 2K. □ 

Fact [T] and Lemma [l] provide more insight in the stack structure and are used in the proof of Theorem [l] 
Item [3] of Fact [T] states that the stack basically consists of a sequence of left closing tags which are the left 
children that are needed for bottom-up verifications of nodes that could not be verified top-down. This 
sequence is followed by a sequence of opening tags for which we still aim a top-down verification. The proof 
of Lemma [1] explains the fact that the two sequences are strictly separated: a left-closing tag uT only remains 
on the stack if at the moment of insertion there are no opening tags on the stack. 

Theorem 1. Algorithm^is a one-pass streaming algorithm for Validity (2) with space 0(v'iVlog7V) and 
0(1) processing time per letter. 

Proof. To prove correctness, we have to ensure validity of all nodes. Leaves are correctly validated upon 
arrival of its opening tag in line [6] Concerning non-leaf nodes, firstly, note that all closing tags are pushed 



on S in line 23 in particular all closing tags of left children appear on the stack. The algorithm removes left 
closing tags only after validation of its parent node, no matter whether the verification was done top-down 
or bottom-up, compare Item ]4] of Fact[l] Emptiness of the stack after the execution of the algorithm follows 
from Item [1] of Lemma [T] and implies hence the validation of all non-leaf nodes. 

For the space bound, Line [T0| guaranties that the number of opening tags in S is always at most K. We 
bound the number of closing tags on the stack by ^ + 2. Item 3 of Lemma [l] states that the stack contains 
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Figure 3: Visualization of the structure of the stack used in Algorithm [T] The stack fulfills the regular 
expression a*&* (e | c | c?e), compare Item [s] of Fact [I] The {ai)i=\...k are closing tags whose parents' nodes 
were not verified top-down. For j > i, is connected to by the right sibling of a^. The (6i)i=i...; form a 
sequence of opening tags such that hi is the parent node of On top of the stack might be one or two 

closing tags depending on the current state of the verification process. 



at most one right closing tag. From Item |4| of Lemma jlj we deduce that S comprises at most ^ + 1 left 
closing tags, since the stream is of length 2jV , and the distance in the stream of two consecutive left closing 
tags that reside on S except the top-most one is at least 2K. A closing tag with depth (a, d) € E' x [TV] 
requires O(logiV) space, an opening tag requires only constant space. Hence the total space requirements 
are 0((;|: -I- 2) log TV -I- K) which is minimized for K = logiV. 

Concerning the processing time per letter, the algorithm only performs a constant number of local stack 
operations in one iteration of the while loop. □ 

Remark Algorithm [l] can be turned into an algorithm with space complexity 0{\JD log D\ where D is 
the depth of the XML document. If D is known beforehand, it is enough to set K = ^JD log D in line [2] If 
D is not known in advance, we make use of an auxiliary variable D' storing a guess for the document depth. 
Initially we set D' = C, C > some constant, we set K = y/D' log D', and we run Algorithm [l] Each time 
d exceeds D' , we double D' , and we update K accordingly. 

This guarantees that the number of opening tags on the stack is limited by 0{^/D log D). Since we 
started with a too small guess for the document depth, we may have removed opening tags that would have 
remained on the stack if we had chosen the depth correctly. This leads to further bottom-up verifications, 
but no more than 0{y^D/ log D) guaranteeing 0{^/D log D) space. 

3.2 Two-pass algorithm 

The bidirectional two-pass Algorithm [2] uses a subroutine that checks in one-pass validity of all nodes whose 
left subtree is at least as large as its right subtree. Feeding into this subroutine the XML document read in 
reverse direction and interpreting opening tags as closing tags and vice versa, it checks validity of all nodes 
whose right subtree is at least as large as its left subtree. In this way all tree nodes get verified. 

The subroutine performs only checks in a bottom-up fashion, that is, the verification of a node v with 
children ci, C2 makes use of the tags ci and C2 (which are adjacent in the XML document and hence easy to 
recognize) and the closing tag of w. When ci, C2 appears in the stream, a 4-tuple consisting of ci, C2, depth(ci) 
and pos(ci) gets pushed on the stack. Upon arrival of v, depth(ci) is needed to identify Ci, C2 as the children 
of V. pos(cr) is needed for cleaning the stack: with the help of the pos values of the stack items, we identify 
stack items whose parents' nodes have larger right subtrees than left subtrees, and these stack items get 
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removed from the stack. In so doing, we guarantee that the stack size does not exceed log(Af) elements 
which is an exponential improvement over the one-pass algorithm. 

Note that the reverse pass can be done independently of the first one, namely eventually in parallel. 



Algorithm 2 Two-pass algorithm validating binary trees 
run Algorithm Is] reading the stream from left to right 

run Algorithm [3] reading the stream from right to left, where opening tags are interpreted as closing tags, and 
vice versa. 



Algorithm 3 Validating nodes with size(left subtree) > size(right subtree) 

1: Z 0; 0; S* empty stack 

2: while stream not empty do 

3: a; next tag on stream (and move stream to next tag) 

4: y next tag on stream, without consuming it yet 

5: n -k- n + 1 

6: if X is an opening tag c then 

7: l^l + l 

8: if 1/ = c then check(c, e, e) end if 

9: else {x is a closing tag c} 

10: I ^l-l 

11: if S has on top (■, ■, Z + 1, •) then 

12: (a, 6, •, •) ^ pop from S; check(c, a, b) 

13: end if 

14: if y is an opening tag d then 

15: push (c, d, I, n) to 5* 

16: end if 

17: end if 

18: while there is si = (•, •, -,711) just below S2 = (•, •, •, W2) in S with n — n2 > n2 — n\ do 

19: suppress S2 from 5* 

20: end while 

21: end while 



Figure |4] visualizes the different cases in Algorithm [3j 



line [8] linc[TT] line [14] 

© 




Figure 4: Visualization of the different conditions in Algorithm [3] The incoming tag x corresponds to the 
highlighted node. 

We highlight some properties concerning the stack used in Algorithm [3] 

Fact 2. S in Algorithm^satisfies the following: 

1. If (oT, 61, depth(a]"), pos(ai)) is below (02, 62, depth(a2), pos(a2)) in S, then pos(aT) < pos(a2), 
depth(a]") < dcpth(a2), and 02,62 are in the subtree of hi. 



2. Consider I at the end of the while loop in line 20 Then there are no stack elements (•,•,/',•) with 
I' > I. 

Figure [5] illustrates the relationship between two consecutive stack elements as discussed in Item [T] of 
FactH 
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Stack S 



(ai , bi , pos(ai) , depth (ai )) 



{a2 , ^2 , pos(a2) , depth (a2 )) 



Figure 5: c is the current element under consideration in Algorithm |3] Oi, bi is in the subtree of b2, compare 
Item [U of Fact m 



Lemma 2. Algorithm\^ verifies all nodes v whose left subtree is at least as large as its right subtree. 
Proof. Let q be such a node. Let ai, bi be the children of q. Then it holds that 

pos(ar) — pos(ai) > pos(6i) — pos(6i), 



(1) 



since the size of the left subtree of q is at least as large as the size of the right subtree. 

Upon arrival of ai Algorithm |3] pushes the 4-tuple t = (oT, 6i, pos(ai), depth(ai)) onto the stack S. We 
have to show that t remains on the stack until the arrival of q. More precisely, we have to show that the 
condition in line [18] is never satisfied for S2 = t. Since the algorithm never deletes the bottom-most stack 
item, we consider the case where there is a stack item (a^, &2i pos(a2), depth(a2)) just below t. Item [T] of 
Fact [2] tells us that ai,bi are in the subtree of 62- Let c be the current tag under consideration such that 
pos(6i) < pos(c) < pos(g). The situation is visualized in Figurejs] 

According to the condition of line IlS) t gets removed from the stack if 



pos(c) — pos(ai) > pos(ai) — pos(a2). 



(2) 



Note that the left side of Inequality [2] is a lower bound on the size of the right subtree of q. Furthermore, 
the right side of Inequality [2] upper bounds the size of the left subtree of q. 

Using pos(c) — pos(a]") < pos(6i) — pos(5i) + 1 and pos(aT) — pos(a2) > pos(a]") — pos(ai), Inequality 
contradicts Inequality [T] which shows that t remains on the stack until the arrival of q. Item [2] of Fact 2 
guarantees that there is no other stack element on top of t upon arrival of q. This guarantees the verification 
of node q and proves the lemma. □ 

Theorem 2. Algorithm ^ is a bidirectional two-pass streaming algorithm for Validity(2) with space 
0(log^-/V) and 0{\ogN) processing time per letter. 

Proof. To prove correctness of Algorithm [2] we ensure that all nodes get verified. By Lemma [2j in the first 
pass, all nodes with a left subtree being at least as large as its right subtree get verified. The second pass 
ensures then verification of nodes with a right subtree that is at least as large as its left subtree. 

Next, we prove by contradiction that for any current value of variable n in AlgorithmjSj the stack contains 
at most log(n) elements. Assume that there is a stack configuration of size t > log(n) + 1. Let (rii, n2 . . . , nt) 
be the sequence of the fourth parameters of the stack elements. Since these elements are not yet removed, 
due to line 18 of Algorithm |3j it holds that n ~ < ~ nt^i, or equivalcntly > l/2(rt + Ui^i), for all 
1 < i < t. Since ni > 1, we obtain that rii > ^-^n + and, in particular, nt_i > (n — 1) + ^. Since all 
Ui are integers, it holds that nt_i > n. Furthermore, since rit > rit-i, we obtain riiog„+i > n + 1 which is a 
contradiction, since the element at position n + 1 has not yet been seen. 
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Since n < 2N and the size of a stack element is in O(logri), Algorithm [s] uses space 0{\og^N). This 
also implies that the while-loop at line 18 of Algorithm [s] can only be iterated O(logn) times during the 
processing of a tag on the stream. The processing time per letter is then 0(log7V), since we assume that 
operations on the stack run in constant time. □ 



4 Validity of general trees 
4.1 Preparation 

The FCNS encoding (see for instance [16] ) is an encoding of unranked trees as extended 2-ranked trees, where 
we distinguish left child from right child. This is an extension of ordered 2-ranked trees, since a node may 
have a left child but no right child, and vice versa. We therefore duplicate the labels a G S to and a^, for 
respectively left and right opening/closing tags. The FCNS tree is obtained by keeping the same set of tree 
nodes. The root node of the unranked tree remains the root in the FCNS tree, and we annotate it by default 
left. The left child of any internal node in the FCNS tree is the first child of this node in the unranked tree 
if it exists, otherwise it does not have a left child. The right child of a node in the FCNS tree is the next 
sibling of this node in the unranked tree if it exists, otherwise it does not have a right child. For a tree t, we 
denote FCNS(t) the FCNS tree, and XML (FCNS (t)) the XML sequence of the FCNS encoding of t. 

Instead of annotating by left /right, another way to uniquely identify a node as left or right is to insert 
dummy leaves with label _L. For a tree t, we denote the binary version without annotations and insertion of 
_L leaves by FCNS"*". The two representations can be easily transformed into each other. In this section, we 
compute the FCNS encoding with annotations. In the next section, we present algorithms for the validation 
of the encoded form that make use of the representation using dummy leaves. See Figure [6] for an example. 




a a c a a \ / \ '\ /\ 

c a c ^ c a c 

a ± a 



Figure 6: Left: introductory example tree t already shown in Figure [T] Middle: FCNS encoding of 
t: XML(FCNS(i)) = rLbLaLaRCRCRaRaLbRbRaLaRaRaLCRCRbiibRbLrZ. Right: FCNS"'" encoding of t: 
XML(FCNS-^(i)) = r6a_L_La_L_Lccaa5_L_L6a_L_Laaacc66&_L_Lf. 

In the following subsections we provide streaming algorithms for the transformation of XML(i) to 
XML(FCNS(i)), that we call the FCNS encoding, and its inverse, the FCNS decoding. 

The FCNS encoding can be seen as a reordering of the tags of XML(t) and an annotation of the tags 
with left /right. We state several properties about the relationship of the ordering of the tags in XML{t) 
and XML(FCNS(i)). Fact|3] concerns the structure of the subsequence of opening tags in XML (FCNS (t)), 
Fact|4] concerns the structure of the subsequence of closing tags in XML(FCNS(i)), and Fact [5] concerns the 
interplay of the subsequences of opening and closing tags in XML(FCNS(i)). 

Fact 3. The opening tags in XML(i) are in the same order as the opening tags in XML(FCNS(t)). 

For a node v of some tree t, let pos'(w) and pos'(u) be the respective positions of the opening and closing 
tags of V in XML(FCNS(t)). 

Fact 4. Nodes vi,V2 oft satisfy pos'(tJr) < pos'(l'2) iff one of the following conditions holds: 

1. vi is in the subtree of V2 in t; 

2. or Vi is a right sibling of V2 in t; 

3. or there is a node u with depth(u) < depth(wi) — 2 such that pos(wi) < pos(w) < pos(u2)- 
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Fact 5. Nodes vi, V2 oft satisfy pos'(wi) < pos'(u2) iff there is a node u with depth(u) < dcpth(ui) — 2 such 
that pos(t;i) < pos(u) < pos(w2)- 



4.2 FCNS encoding 

In this section, we are interested in computing the transformation XML(t) — > XML(FCNS(i)). Our strategy 
is to compute the subsequence of opening tags of XML (FCNS (t)) (using Fact [3 and discussed in subsec- 
tion 



4.2.1) and the subsequence of closing tags (using Fact |4] and discussed in 



independently, and merge them afterwards (using Fact [s] and discussed in subsection 4.2.3) 



4.2.2) of XML(FCNS(i)) 



4.2.1 Computing the sequence of opening tags 

Concerning the opening tags, since due to Fact [s] the subsequences of opening tags in XML(f) and 
XML(FCNS(i)) coincide, we extract the subsequence of opening tag of XML(t), and we annotate them 
with left or right as they should be in XML(FCNS(<)). Remind that an opening tag is left if it is the opening 
tag of a first child, otherwise it is right. Furthermore, for later use we annotate each opening tag c with 
depth(c) in t and the position in the stream pos(c). We summarize this as a fact: 

Fact 6. There is a streaming algorithm with space O(logiV) that, given XML(i) as input, outputs on an 
auxiliary stream the sequence of opening tags o/ XML (FCNS (t)) with left/right annotations, and furthermore, 
annotates each tag c with depth(c) and pos(c), performing one pass on each stream. 



4.2.2 Computing the sequence of closing tags 

For computing the sequence of closing tags, we start with the sequence of opening tags of XML{t) as 
produced by the output of the algorithm of Fact [6] that is, correctly annotated with left/right and with 
depth and position annotations. To obtain the correct subsequence of closing tags as in XML(FCNS(i)), we 
interpret the opening tags as closing tags and we sort them with a merge sort algorithm. Merge sort can be 
implemented as a streaming algorithm with 0(log(A^)) passes and 3 auxiliary streams [5]. For the sake of 
simplicity. Algorithm H assumes an input of length 2' for some I > 0. 



Algorithm 4 Merge sort 

Require: unsorted data of length 2* on stream 1 
1: for i = 0. . - 1 do 

2: copy data in blocks of length 2* from stream 1 alternately onto stream 2 and stream 3 
3: for j = 1 . . . 2'-'-i merge(2') end for 
4: end for 



merge(&) reads simultaneously the next b values from stream 2 and stream 3, and merges them onto 
stream 1. The whole loop in Line 3 of Algorithm |4] requires one read pass on stream 2, one read pass on 
stream 3, and one write pass on stream 1. See Figure [7] for an illustration. 



line 2 (copy) 
str 1: Bi B2 B3 B4 •■■ B-j,-, 
str 2: Bi B3 ■ ■ -Bi^.^i 
str 3: B2 Bi ■■■ Bi^t 



line 3 (merge) 
B12 -B34 ■ • • -B2'-i-12'-' 

Bi B3 ■■■Bi-,-! 
B2 Bi ■■■ Bi.i 



Figure 7: In Line 2, blocks from stream 1 are copied onto stream 2 and stream 3. The Bi are sorted blocks. 
In line 3, all blocks Bi and Bi^i are merged into a sorted block Bn^i^iy 

In order to use merge sort, we have to define a comparator function that, given two closing tags cl,C2, 
decides whether pos'(ci) < pos'(c2). Firstly, consider nodes fi, V2 with pos(wi) < pos(w2) to be as in Point 1 
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or Point 2 of Fact [4] that is, either vi,V2 are sibhngs or one node is contained in the subtree of the other 
one. Evidently, their ordering with respect to pos' can easily be decided by their depth: pos'(iJi) < pos'{v2) 
iff depth(wi) > depth(w2)- 

If neither Vi,V2 are siblings, nor V2 is in the subtree of Vi (Point 3 of Fact[4|, then pos'(^Jr) < pos' (1)2), 
independently of their depths. A comparison function hence should be able to infer the relationship of the 
two nodes, however, this seems to be difficult in the streaming model. 

To overcome this problem, instead of defining a comparison function, we design a complete merge function 
in Lemma [3] that, by construction, only compares two nodes of the first kind. The key idea is to introduce 
separator tags which we denote by a new tag outside S. They are initially inserted right after each closing 
tag of a last child u, that is exactly before the depth decreases. We denote by u the separator we introduce 
when seeing the last child u, and we define depth(?l) = depth(u). 

Fact 7. There is a streaming algorithm with space O(logiV) that, given a sequence XML(t) on a stream, 
computes on another stream the sequence of opening tags XML(FCNS(t)) together with their separators, and 
annotated with depth, pos and left/right, performing one pass on each stream. 

We have to define the way we integrate the separators into our sorting. Let vi,V2, ■ ■ ■ ,Vk be the ordered 
sequence of the children of some node. For the separator vj: we ask their position among the closing tags to 
satisfy for each node v: 

pos'(w) < pos'(wfc) iff pos'(u) < pos'(tJr); (3) 

and for any other separator Wk: 

pos'{vi:) < pos'(uJ^) iff pos'(wfc) < pos'(wfc). (4) 

Blocks appearing in merge sort fulfill a property that wc call well-sorted. A block B of closing tags is 
well-sorted if the corresponding tags in XML(FCNS(<)) appear in the same order, and for all vi, V2 (z B with 
pos(ui) < pos(f2), all closing tags v of nodes v with pos(fi) < pos(w) < pos(w2) are in B as well. 

In addition, for two blocks Bi,B2 of closing tags, we say that (_Bi,i?2) is a well-sorted adjacent pair, if 
Bi and B2 are well-sorted, for each closing tag tJT e Bi and each closing tag V2 & B2 pos(wi) < pos (1)2) is 
satisfied, and furthermore, all closing tags v of nodes v with pos(i;i) < pos(w) < pos(t;2) are either in Bi or 

B2. 

The only function to design is a comparator deciding for two closing tags vi, V2 from a well-sorted adjacent 
pair {Bi,B2) whether pos'(iJr) < pos'(w2)- 

The following lemma shows that we can merge a well-sorted adjacent pair correctly. 

Lemma 3. Let {Bi, B2) he a well-sorted adjacent pair, and let vi — Bi[pi\ and V2 — B2[p2] for some pi,p2- 
Assume that pos'(w) < pos'(wi) and pos'(w) < pos'(w2), for all v £ Bi[l,pi — 1] U B2[1,P2 — !]• Then: 

1. If Vi is a separator, or there is a separator in Bi after vi, then pos'(ui) < pos'(v2)/ 

2. Else if V2 is a separator then: 

(a) «/depth(ui) < depth(u2) then pos'(i;i) < pos'{v2), 

(b) else pos'(wi) > pos'(w2); 

3. Else (neither Vi nor V2 is a separator): 

(a) if depth(i)i) < depth(u2) then pos'(t>i) < pos'(ii2); 

(b) else pos'(z;i) > pos'(ti2). 

Proof. Let {Bi,B2) be a well-sorted adjacent pair. Let I = max{z : Bi[i] is a separator}. If there are no 
separators in Bi, let I — 0. First, we prove Point 1. Since Bi is well-ordered, we only need to check 
that pos'(i?i [I]) < pos'(i?2 [I])- Denote by u the last child that was responsible for the insertion of the 
separator tag Bi[l]. Let u' be the left-most sibling of u. Due to Equation ^ it suffices to show that 
pos'(u') < pos'(i32[l]). Clearly, the shortest path from u' to B2[l] passes by a common ancestor p of u' and 
B2[l] which is not the parent of u' since the separator Bi[l] indicates that the last child u has been seen. 
Then, by the third condition of Fact|4j we get pos'(m') < pos'(i32[l]). 

For proving Points 2 and 3 we use the observation that if the premises to Point 1 are not fulfilled, vi , V2 
do not have a common ancestor p s.t. pos(wi) < pos(p) < pos(t;2) and p is not the parent node of vi. 
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Furthermore, this observation impHes that depth(?;2) > depth(?;i) — 1 and hence, if depth(w2) > depth(i;i) 
then V2 is in the subtree of wi. This and Fact |4] prove Points 2a, 2b, 3a and 3b. 

We prove the observation by contradiction. Assume that there is such a node p. Since {Bi,B2) is a 
well-ordered adjacent pair and pos('yi) < pos(p) < pos(w2), node p would be in Bi U i?2- Therefore, the 
separator u inserted after the rightmost sibling of vi would be also in Bi U B2 as well. More precisely, this 
separator would be in _B2[1 . . .p2 — 1] since otherwise Point 1 would have been applied. This, however, is a 
contradiction to the assumption that pos'(w) < pos'(wi) Vw £ Bi[l . . . pi — 1] (J B2[l ■ ■ ■ P2 — ^ since it holds 
that pos'(i;i) < pos'(u). Hence such a node does not exist. □ 

Lemma 4. There is a 0(log N)-pass streaming algorithm with space O(log A^) and 3 auxiliary streams that 
computes the subsequence of closing tags of the FCNS encoding of any XML document given in the input 
stream. 

Proof. Using Fact[7| we compute on the first auxiliary stream the sequence of opening tags with corresponding 
annotations, together with separators, and interpret opening tags as closing tags. 

We show that we can do a merge sort algorithm with a merge function inspired by Lemma [3] on the first 
three auxiliary streams with O(logA) space and passes. For that assume that the first stream contains a 
sequence {Bi, B2, ■ . . , Bm) of blocks of size 2*. For simplicity we assume that M is even, otherwise we add 
an empty block. We alternately copy odd blocks on the second stream, and even blocks on the third stream. 
For a block i?2i that we write on the third stream, we write before each of them, the number of separators 
that occur in the block i?2i-i that was copied on the second stream. 

Then we merge sequentially all pairs of blocks {B2k-i, B2k) for 1 < fc < M/2 using Lemma|3] Note that 
{B2k-i, B2k)i are all well-formed pairs. Let / = max{i : B2k-i[i] is a separator}. Firstly, we copy elements 
-B2fe-i[l,^] onto auxiliary stream 1. Knowing the number of separators in -B2fc-i allows us to perform 
this operation. The correctness of this step follows from Point 1 of Lemma [3j Then, we merge blocks 
B2k-i[l + 1, 2'] and i?2fe by using the comparison function defined in Points 2 and 3 of Lemma [3] □ 

4.2.3 Merging opening and closing tags 

Merging the subsequence of opening tags of XML (FCNS (t)) and the subsequence of closing tags of 
XML(FCNS(i)) can be done using one additional pass. 

Lemma 5. There is a streaming algorithm with space O(logn) that, given the sequence of opening tags 
of XML (FCNS (t)) on a stream, and the sequence of closing tags of XML (FCNS (i)) on another stream, 
computes XML(FCNS(t)) on a third stream using one pass on each stream. 

Proof. We directly apply Fact [5j so that we know when to alternate from the sequence of opening tags to 
the one of closing tags, and conversely. □ 

From Fact [6j Lemma [4] and Lemma [5] we obtain Theorem [3j 

Theorem 3. There is a 0{log N)-pass streaming algorithm with space O(log A^) and 3 auxiliary streams and 
0(1) processing time per letter that computes on the third auxiliary stream the FCNS transformation of any 
XML document given in the input .stream. 

Proof. Firstly, we compute according to Lemma|4]the sequence of closing tags and we store them on auxiliary 
stream 1. Then, by Fact [6] we extract the sequence of opening tags, and we store them on auxiliary stream 
2. By Lemma[5]we can merge the tags of auxiliary stream 1 and auxiliary stream 2 correctly onto stream 3. 

The space requirements of these operations do not exceed O(logA^). The processing time per letter of 
these operations is constant. □ 
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4.3 Checking Validity on the encoding form 

In this section, we reuse the algorithms for vaUdating binary trees for the vahdation of the encoded form. 
We discuss one-pass read/write streaming algorithms (one for left-to-right passes, and one for right-to-left 
passes) that read XML(FCNS^(t)) and output an XML document with annotations on closing tags that can 
be fed into Algorithm [T] or Algorithm [2] This requires little modifications in Algorithm [T] and Algorithm [2] 
since validity then depends on the annotations. Since we want to reuse the algorithms for the validation of 
binary trees, we suppose that XML(FCNS^(i)) is available as input. The algorithm stated in Theorem [s] 
can be easily adapted such that it outputs XML(FCNS-^(i)) instead of XML(FCNS(t)). 

The problem of validating the encoded form and the problem of validating binary trees are similar. 
Note that the children ui, . . . Vk of a node v form a substring in XML(FCNS^(t)), see Figure [sj Hence, for 
validating a node v, the label v has to be related to the block of children nodes Vk . . .vi. This is similar to 
the task of the validation of binary trees where the parent label v has to be related to the block of children 
nodes viV2- 



FCNS^(t) 

V 

tl t'J tk 

Figure 8: A tree t and its FCNS"*" encoding. While the opening and closing tags of the children of a node 
V are separated by the subtrees ti,. . .tk in XML(t), the closing tags of the children of v are consecutive in 
XML(FCNS^(i)) in reverse order, that is VkVk^i ■ ■ ■ V2V1 is a substring of XML(FCNS^(t)). 

For a node w, we gather the information of the children nodes wi, . . . , by the help of finite automata 
Ai (for left-to-right passes) and A2 (for right-to-lcft passes) that we define later, and the information - a 
state of the automata - is annotated at the closing tag of leaf vi (left-to-right) or Vk (right-to- left). Then, 
by the help of Algorithm [T] or Algorithm [2j this information is related to the parent label v. 

We define automata Ai and A2 now. Ai is constructed from automaton A. Let A = {'E,,Q,qQ,S, F) be 
a deterministic finite automaton where T, is its input alphabet, Q is the state set, qo is its initial state, 
S : Q X ^ Q is the transition function, and is a set of final states. For a G S and the input DTD 
D, denote by Aa a deterministic finite automaton that accepts the regular expression D{a). We compose 
the Aa as in the left illustration of Figure [9] to an automaton A that accepts words w' such that w' = auj, 
a G S, cj G E* if cj G D[a). 



V 

/ 

v\ 

A \ 

f '"2 

' A 
t' 



\ 



\ 



( -t. ) 




. ( -^1 ) 



( M ) 




Figure 9: Left: Automaton A. A\ accepts words \i A accepts w™^. Right: Automaton A2 is a version of 
the illustrated automaton without e transitions. 

Let A\ = (E, Qi, ((70)1, (^i, i^i) be a deterministic finite automaton that accepts a word w, iff is 
accepted by A, where uf^^ denotes w read from right to left. 

Let A2 — (E, Q2, ('70)2, '^2, ^2) be a deterministic finite automaton that accepts a word cj' such that 
uj' = wa, a G G E* if cj G D{a). A2 is a version of the automaton in the right illustration of Figure^ 
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without e transitions. 

In the following, we assume for some state qi G Qi and (72 S Q2 that 151(91, -L) — qi and (S2(<Z2, -L) = q2- 
Given XML(FCNS^(i)), for a left-to-right pass, we annotate closing tags w by a state from the state set 
Qi of automaton Ai. We denote the annotation for left-to- right passes of v by anni(u). 

if w is a leaf then anni(u) — Si{{qo)i,v), 

otherwise let u be the right child of v, then anni(u) = (5i(anni(u), w). 

For a right-to-left pass, we annotate closing tags w by a state form the state set Q2 of automaton A2. We 
denote the annotation of right-to- left passes of v by ann2 (v) . 

if t; is a left child then ann2(i^) — S2{{qo)2ii^)i 

if t; is a right child of u then ann2(w) = (52(ann2(?l), v) . 

For the sake of completeness, the root can be annotated by ann2(r) = (50)2, though this annotation will 
not be used for checking validity. Figure [TO] shows the annotations of the children nodes Vi, . . . ,Vk of node 

V. 



V v " 

\ ^ _ ^ _ 

V2 'i'2(<5i(anni(»>i),i)2)) «2(52(ann2('Ui). 112)) 

\ \ \ 



Vk f/t('5i(((?o)i,i'i)) "fc(<52(ann2(?)s:_i),tifc)) 

Figure 10: Left: a node v with its children vi, . . .Vk in the FCNS tree. Middle: annotations for left-to-right 
passes. V is valid if ^i(anni(tJi), u) results in an accepting state of Ai. Right: annotations for right-to-lcft 
passes. V is valid if 52{a.im2{vk) , v) is an accepting state of A2. 

The annotation operations can be seen as streaming algorithms performing one read pass over the input 
and one write pass over another stream using constant space, since the annotation of a closing tag v only 
depends on the annotation of its right child (for left-to-right passes) or its parent (for right-to-left passes). 
The respective closing tag is in in both cases the tag prior to v in the input stream XML(FCNS^(t)). 
Remember that we consider a right-to- left pass for the annotation with ann2. 

In the following, we prove that given the annotations. Algorithm [T] and Algorithm [2] can be adapted to 
decide validity of the encoded form. 

Theorem 4. There is a one-pass deterministic algorithm for VALIDITY with space 0{\/N log A^) and 0(1) 
processing time per letter when the input is given in its FCNS"'" encoding. 

Proof. Append the rule D{1.) = e to the input DTD D. Compute automaton Ai. Compute a new XML 
stream with annotations anui and feed this stream directly into Algorithm [T] In order to verify a node v, 
Algorithm [1] uses the closing tag wT of the left child vi of v. Since v is valid if i5i(anni(wr), v) is an accepting 
state, we only have to replace the check function used in Algorithm [l] The new check function computes 
i5i(anni(?>i"), f) and aborts if the resulting state is not accepting. 

The space requirements and the processing time per letter inherit from Algorithm [T] □ 

Theorem 5. There is a bidirectional two-pass deterministic algorithm for VALIDITY with space 0(log2 N) 
and O(logiV) processing time per letter when the input is given in its FCNS encoding. 

Proof. Append the rule -D(-L) = e to the input DTD D. Compute automata Ai and A2. Firstly, in a 
left-to-right pass, as in the proof of Theorem [4] we compute a new XML stream with annotations anui and 
feed this stream directly into Algorithm |3] We adapt the check function as above. 

Concerning the right-to-left pass, we compute the annotations ann2, and feed this stream directly into 
Algorithm [3] interpreting opening tags as closing tags, and vice versa. Note that the annotations ann2 are 
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hence annotated on opening tags. Let w be a node with children vi, . . . ,Vk- We have to show how AlgorithmjS] 
can be adapted to relate the annotation of the opening tag vj. to v. When Algorithm [3] reads the closing 
tag w and the opening tag vi, it pushes (w, vi, ■, ■) on the stack, where w is the sibling of vi in the FCNS^ 
encoding. Note that the subsequent tags on the stream are V2,V3, . . .Vk- Node Vk can be identified since Vk 
is either a leaf or followed by a tag with label _L. Hence, the stack item {W,vi, ■, •) can be annotated with 
ann2(wfc) when Vk is seen. Again by adapting the check routine, we can compute (^2(ann2(t'fe), fi) and abort 
if the result is not an accepting state. 

The space requirements and the processing time per letter inherit from Algorithm [2] □ 

Applying the bidirectional algorithm of Thcoremjsjon the encoded form XML(FCNS''"(i)), we obtain that 
validity of general trees can be decided memory efficiently in the streaming model with auxiliary streams. 

Corollary 1. There is a bidirectional 0{log N)-pass deterministic streaming algorithm for VALIDITY with 
space 0{\og^N), 0{\ogN) processing time per letter, and 3 auxiliary streams. 



4.4 Decoding 

In the following, we present a streaming algorithm for FCNS decoding, that is, given XML(FCNS(i)) of 
some tree t, output XML(i). We start with a non-streaming algorithm, Algorithm ^ performing this task. 



Algorithm 5 offline algorithm for FCNS decoding 



for i = 1 ^ 2iV do 

if X[i] is an opening tag then 
write X[i] 

if X[i] does not have a left subtree then 

write X[i] 
end if 

else if X[i] is a left closing tag then {See Figure 11 \ 

let p be the parent node of X[i] 

write p 
end if 
end for 




subtree of Di subtree of V2 

Figure 11: The main difficulty of the FCNS decoding is to write the closing tag of a node p when the closing 
tag of its left child is seen. This is difficult when the subtrees of vi and V2 are large. 

We describe how this algorithm can be converted into a streaming algorithm. For some opening tag X[i], 
checking the condition in Line |4] can easily be done by investigating X[i + 1]. If A'[« + l] is a right opening tag 
or equals X[i], X[i] does not have a left subtree. The difficulty in converting this algorithm into a streaming 
algorithm is in Line [8] it is difficult to keep track of opening tags until the respective closing tags of their 



left children are seen, and indeed, this can not be done with sublinear space in one pass (Fact 10). 



In the following, we present a streaming algorithm that performs one pass over the input, but two passes 
over the output, and uses 0{\/N log N) space, and a streaming algorithm that performs O(logiV) passes 
over the input and 3 auxiliary streams using 0(log^(-/V)) space. 
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4.4.1 One read-pass and two write- passes 

We read blocks of size V N log N and execute Algorithm [5] on that block. In Lemma [g] we shows that in any 
block there is at most one left closing tag for which the parent's opening and closing tag are not in that 
block. Hence per block there is at most one left closing tag for which we can not obtain the label of the 
parent node. We call this closing tag critical. In this case we write a dummy symbol on the output stream 
that will be overwritten by the parent closing tag in the second pass. The closing tag of the parent node will 
arrive in a subsequent block, and it can easily be identified as this since it is the next closing tag arriving 
at a depth —1 of the critical closing tag. We store it upon its arrival in our random access memory. Since 
there is at most one critical closing tag per block and we have a block size of y/N log N, we have to recover 
at most 0{y/N/ \ogN) parent nodes. At the end of the pass over the input stream we have recovered all 
closing tags of parent nodes for which we wrote dummy symbols on the output stream. In a second pass 
over the output stream we overwrite the dummy symbols by the correct closing tags. 

The complexity derives from the following lemma demonstrating that in a block there is at most one 
critical left closing tag. 

Lemma 6. Let X[i^j] be a block. Then there is at most one left closing tag a with parent node p such that: 

pos(p) < i < pos(a) < j < pos(p). (5) 

Proof. By contradiction, assume that there are 2 left closing tags a, b with p being the parent node of a, 
and q being the parent node of 6, for which Inequality [s] holds. Wlog we assume that pos(p) < pos(g). 
Since pos(p) < i>os{q) < pos(a), q is contained in the subtree of a or g = a. This, however, implies that 
pos(g) < pos(a) < j contradicting pos(g) > j. □ 

Theorem 6. There is a streaming algorithm using 0{\/N log N) space and 0(1) processing time per letter 
which performs one pass over the input stream containing XML(i:) and two passes over the output stream 
onto which it outputs XML(FCNS(t)). 



4.4.2 Logarithmic number of passes 

Again, we use the offline Algorithm [5] as a starting point for the algorithm we design now. For coping with 
the problem that it is hard to remember all opening parent tags when their corresponding closing tag ought 
to be written on the output, we write categorically dummy symbols on the output stream for all parent 
closing tags. The crux then is the following observation: 

Fact 8. Let cij^ . . .cnl be the subsequence of closing tags of left children 0/ XML(FCNS(t)). Then the 
sequence pi . . .pN is a subsequence o/XML(t) where pi is the parent node of Ci in FCNS(i). 

We apply a modified version of our bidirectional two-pass Algorithm[2]to recover the missing tags. Instead 
of checking validity, once the check function is called in Algorithm |3] with variables (a, 6, c), we output the 
parent label a onto an auxiliary stream, annotated with pos(5). We do the same in a reverse pass over the 
input stream counting positions from 2N downwards to 1. In so doing, the auxiliary stream contains all 
parent labels for which dummy symbols are written on the output stream. 

Fact [8] shows that it is enough to sort by means of two further auxiliary streams the auxiliary stream 
with respect to the annotated position of the closing tags of the left children of these nodes. In a last pass 
we insert the parent closing tags into the output stream. 

Theorem 7. There is a 0{\og N)-pass streaming algorithm with space O(log^iV) and O (log TV) processing 
time per letter and 3 auxiliary streams that computes on the third auxiliary stream the FCNS decoding of 
any FCNS encoded document given in the input stream. 
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Figure 12: Left: hard instance. Right: its FCNS encoded form. 



5 Lower bounds for FCNS encoding and decoding 

We define a family of hard instances of length = Q{n) for the computation of the FCNS encoding of a 
tree as in Figure [121 

It is easy to see that computing the sequence of closing tags in the FCNS encoding requires to in- 
vert a stream. Let i be a hard instance. Then XML{t) = rxiXiX2X2 ■ ■ ■ XnX^, and XML (FCNS (t)) = 
tlXilX2Fi. ■ ■ ■ XnRXnRXn~iR. ■ ■ ■ X2RXiLrL- Since the writing on the stream can only start after reading Xn, 
we deduce that memory space Q{n) is required, in order to store all previous tags. 

Fact 9. Every one-pass randomized streaming algorithm for FCNS encoding with bounded error requires 
fl{N) space. 

We conjecture that this argument can be extended as follows: 

Conjecture 1. Any p-passes randomized streaming algorithm for FCNS encoding with bounded error requires 
space fl(N/p). 

We now define another family of hard instances of length N = 0(n) for decoding a FCNS encoded tree 
as in Figure [TSj 



r 




subtree y x, 



Figure 13: Left: hard instance in FCNS form, where y is any tree of size 9(n). Right: its decoded form. 

Intuitively, decoding the tree of any hard instance requires to put the full tree y into 
memory. Let XML (FCNS (i)) denote a hard instance which we aim to decode into XML{t). 
Then: XML(FCNS(t)) = rxiL ■ ■ ■ XnLXnL ■ ■ ■ Xk+iLYxkL ■ ■ ■ a^iL^'L and the decoded form is 
XML(t) — rxi . . . XnXn-i ■ ■ ■ XkZxk-i ■ ■ ■ xif where Z is the decoded form of Y. Since Z can only be written 
after Xk, and since x^ cannot be memorized because k was unknown until we reach Y, memory space fl{n) 
is required. This argument can be easily formalized using standard information theory arguments. 

Fact 10. Every one-pass randomized streaming algorithm for FCNS decoding with bounded error requires 
space fl{N). 
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This argument can be extended to two-pass randomized streaming algorithms. Construct a hard instance 
of size 0(n^) by gluing n previous instances (x*, A;% y^)i<i<n as follows: instance (x*"*"^, fc*"*"^, y*"*"^) is branched 
to the left most leaf of instance (x', k^,y^). After the first pass, the algorithm is not able to write n closing 
tags of form x^, . Therefore he needs to store them in order to write them at the second pass. This requires 
some formalization that we omit here. 

Fact 11. Every randomized streaming algorithm for FCNS decoding with bounded error, one pass on the 
input stream, and two passes on the output stream requires space il{\/N). 
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A A Q{N/p) space lower bound for p-pass Algorithms for Validity 

For the sake of clarity, in this section we provide a proof showing that p-pass algorithms require fl{N/p) space 
for checking validity of arbitrary XML files against arbitrary DTDs. Many space lower bound proofs for 
Streaming Algorithms are reductions to problems in communication complexity [Hl2lll4j. For an introduction 
to communication complexity we refer the reader to [13] . 

Consider a player Alice holding an N bit string x = Xi . . . xjv, and a player Bob holding an N bit string 
y — yi . . . yN both taken from a uniform distribution over {0, 1}^. Their common goal is to compute the 
function f(x,y) — \/^x[i] A y[i] by exchanging messages. This communication problem is the well studied 
problem Set-Disjointness (DISJ). 

It is well known that the randomized communication complexity with bounded two-sided error of the Set 
Disjointness function _R(DISJ) — 8(iV). In this model, the players Alice and Bob have access to a common 
string of independent, unbiased coin tosses. The answer is required to be correct with probability at least 
2/3. 

We make use of this fact by encoding this problem into an XML validity problem. Consider S = {r, 0, 1}, 
the DTD such that D^^^^(r) = OrO | Orl | IrO | e, D^^^'^{0) = e, and D^^^-^{1) = e. Given an input x, y 

as above, we construct an input tree t{x,y) as in Figure [M] 

r 

xi ^ r yi 

X2 r Vi D^iSJ (j,) ^ OrO | Orl | IrO | e 

•^n r yn 



Figure 14: t{x, y) is a hard instance for Validity. 

Clearly, BIS J(x, y) = if and only if XML(<(a;, y)) is vahd with respect to D°^^\ 

Theorem 8. Every p-pass randomized streaming algorithm for VALIDITY with bounded error uses Q{N/p) 
space, where N is the input length. 

Proof. Given an instance x G {0, 1}^, y € {0, 1}^ of DISJ, we construct an instance for Validity. Then, 
we show that if there is a p-pass randomized algorithm for Validity using space s with bounded error, then 
there is a communication protocol for DISJ with the same error and communication 0(s -p). This implies 
that any p-pass algorithm for Validity requires space il{N/p) since i?(DISJ) = 8(iV). 

Assume that A is a randomized streaming algorithm deciding validity with space s and p passes. Alice 
generates the first half of XML(t(a::, y)), that is rxiXirx2X2 . . . rXnX^r of length 2iV+2 and executes algorithm 
A on this sequence using a memory of size 0(s). Alice send the memory of size at most s to Bob via message 
M\ who continues algorithm A on the second half of XML(t(a;, y)), that is rynjM^ . . . ry2y2Xyiyif of length 
2A^ + 2 using memory M\. After execution, Bob sends the memory of size at most s back to Alice via Mg. 
This procedure is repeated at most p times. 
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This protocol has a total length of O(s-p) which we know to be rt{N) since i?(DISJ) G Q{N). The claim 
follows. □ 



B Conjecture: deciding Validity (2) in p passes requires Q{\/N /p) 
space 

In the following, we motivate a conjecture lower-bounding the space for streaming algorithms deciding 
Validity(2). 



Conjecture 2. A p-pass streaming algorithm for Validity (2) requires VL{-\/N/p) space. 

For positive integers to, n, we define a family of hard instances for Validity(2) of length N = Q{mn) as 



in Figure 15 The binary DTD D we consider is: D{r) = 01|10, D{0) = 01|10|ll|e, £1(1) = 01|10|00|e. 



X2W ' 




x[n — 1] 
x\n\ 1 - x\n] 



■Ti[fcil 



xilki + 1] ■ 



xi[n] ■ 



.Mi]' 



X2lk2] 



X2[k2 + 1] ' 



d2 



Figure 15: Top: instance for m — 1, where a; is a 71-bit string and 1 < A; < n — 1. Bottom: assembly of the 
base case to instances of size to. Instances are valid iff {di 7^ Xi[k] or di Xi[k + 1]), for all 1 < i < to. 

Intuitively, for m = n deciding validity in one pass is difhcult with space o(n). After reading the sequence 
of opening tags Xi[l, n], the streaming algorithm does not have enough space to store the information about 
the bit at unknown index fc.;. When it reads bit di, if di = Xi[ki + 1], it is therefore unable to decide whether 
di — Xi[ki]. Moreover, after reading d^, it does not have enough space to store information about all bits 
xi[ki + l],X2[k2 + 1], . . . ,Xm[km + 1]- When it reads the closing tags after dm, if di = Xi[ki], it therefore 
misses out on its second chance to check whether di — Xi[ki + 1] for every i. 
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Similar to [TH [T^ [S] , we translate this approach into the language of communication complexity. How- 
ever, unlike in [141 1121 |S] where the resulting communication problem is a two-party problem, the resulting 
communication problem involves three parties and comes with a technical difficulty. We leave a proof for 
this lower bound as an open problem. 
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